From Checkbox to Textbox

Capturing Nuanced Public Opinion with Large Language Models

Laurence-Olivier
M. Foisy

Université Laval

Hubert Cadieux
Étienne
Proulx
Camille Pelletier
Yannick Dufresne

June 28, 2025

The Challenge of Open-Ended Data

  • Open-ended questions offer unparalleled nuance but are notoriously difficult to analyze at scale.
  • This has led to a trade-off: the qualitative richness of text vs. the quantitative scalability of closed-ended items.
  • Can we resolve this long-standing methodological tension?

Using Large Language Models

  • LLMs present a potential solution, capable of understanding and categorizing text.
  • However, the literature raises valid concerns about transparency, reproducibility, and accuracy (the “black box” problem).
  • Our work addresses this by testing if LLMs can be a reliable tool for survey research.

Question

  • Question: Can LLMs measure the same latent constructs as traditional closed-ended questions?

Survey Design

  • Design: A survey experiment with two conditions:
    • Group 1: Traditional closed-ended (Likert) questions
    • Group 2: Identical questions but Open-ended
  • Questions:
    • 7 Socioeconomic
    • 1 Vote Intention
    • 7 Environmental
    • 5 Immigration Attitudes
  • characteristics:
    • 1,685 respondents in the Open-ended group with 1,237 complete responses
    • 1,687 respondents in the Closed-ended group 1,580 complete responses

Our Method: A Two-Stage LLM Approach

  • We developed a novel, transparent, and scalable coding process.
  • Stage 1 (Prompt Generation): A sophisticated model (Gemini Pro) generates a custom, optimized prompt for each specific survey item.
  • Stage 2 (Response Coding): A fast, efficient model (Gemini Flash Lite) uses these prompts to code thousands of responses.
  • Consensus Mechanism: We use 10 parallel coding instances for each response to ensure high inter-coder reliability.

Two-Step LLM Prompting Process Shell Script Orchestrator Step 1: Create Custom Prompts Gemini 2.5 Pro 20 Prompts Step 2: Code Responses Gemini 2.0 Flash Lite 20 R Processes (1 per variable) 10 Parallel Requests per R process Modal Response (Consensus)

Prompt

You are an expert at creating optimized prompts for AI systems that process survey data. Your task is to generate a specialized prompt for coding open-ended survey responses to a specific survey variable.

## Variable Information:
- Variable name: {variable_name}
- Question text: {question_text}
- Variable type: {variable_type}
- Domain/topic: {variable_domain}
- Sample response values: {sample_values}

## Response Categories/Options:
{categories_info}

## Sample Open-ended Responses (if available):
{sample_responses}

## Language Information:
{language_info}
IMPORTANT: This is a bilingual survey (French and English). Responses may be in either language.

## Your Task:
Generate an optimal prompt consisting of TWO parts:

### PART 1: SYSTEM MESSAGE
Create a system message that:
- Defines the AI assistant's role and expertise for this specific variable type
- Explains the task clearly (mapping open responses to codes)
- Provides domain-specific guidance relevant to this variable's topic
- Emphasizes returning only the numeric code
- Includes any special considerations for this variable type
- CRITICAL: Explicitly mentions handling both French and English responses
- Provides key French translations for common responses (oui=yes, non=no, etc.)
- Warns against coding valid French responses as Don't know

### PART 2: USER TEMPLATE
Create a user message template that:
- Uses placeholder variables: {{variable_name}}, {{question_text}}, {{options_block}}, {{open_response}}
- Is formatted clearly for easy reading
- Includes appropriate context for this variable type
- Follows this general structure but adapt the labels/sections as needed:

Variable: {{variable_name}}
Question: {{question_text}}

[Appropriate section title for the options]:
{{options_block}}

Open-ended response:
"{{open_response}}"

## Requirements:
- Be specific to this variable's domain and characteristics
- Consider the types of responses likely for this question
- Optimize for accuracy in mapping responses to the correct codes
- Keep instructions clear and concise
- Ensure the prompt will work well for the sample responses shown
- MUST handle bilingual responses correctly (French and English)
- Include guidance on common French political terms if relevant (e.g., Libéral = Liberal Party)
- Provide clear French-English mappings for agreement/disagreement terms

## Output Format:
IMPORTANT: Return ONLY valid JSON in this exact format. Do not include any other text or explanation:

{{
  "system_message": "Your system message here...",
  "user_template": "Your user template here..."
}}

Generate the optimized prompt now:

Custom AI-Generated Prompt

Question: The federal carbon tax should be maintained as an environmental measure

### System Message

You are a specialized survey data coding assistant with expertise in Canadian environmental and political policy. Your task is to 
analyze an open-ended survey response and assign the single most appropriate numeric code from the provided list. The response is 
for a question about maintaining the federal carbon tax as an environmental measure.

CRITICAL INSTRUCTIONS:

1.  **Bilingual Processing:** This is a bilingual survey. Responses may be in English or Canadian French (FR-CA). You must 
accurately interpret both languages. Do NOT code a valid French response as 'Don't know' (Code 3) simply because it is in French.

2.  **Key Translations:** Pay close attention to these common terms:
    *   **Agreement:** Oui (Yes), D'accord (Agree), Pour (For), Favorable (Favorable).
    *   **Disagreement:** Non (No), Pas d'accord (Disagree), Contre (Against), Défavorable (Unfavorable).
    *   **Uncertainty:** Je ne sais pas / NSP (I don't know).
    *   **Qualifiers:** 'Totalement' or 'Fortement' indicate strong agreement/disagreement. 'Un peu' or 'Plutôt' indicate 'somewhat'.

3.  **Coding Nuance:** The question is about *maintaining* the tax. A response like "Yes, but we need to do more" ("Oui, mais on doit 
faire plus") still indicates agreement with *maintaining* the tax and should be coded as Agree (5 or 6). A simple "Oui" or "Yes" 
should be coded as Strong Agreement (6).

4.  **Output Format:** You must return ONLY the single numeric code corresponding to the best-fit category. Do not provide any 
explanation, text, or punctuation.

### User Template

Variable: {variable_name}
Question: {question_text}

Response Options and Codes:
{options_block}

Open-ended response to code:
"{open_response}"

### Response Options
1: Strongly disagree
3: Don't know/Prefer not to answer
4: Somewhat disagree
5: Somewhat agree
6: Strongly agree

Discussion & Implications

Thank You!

Questions?

Suplementary Materials